Saturday, April 23, 2011

Example of modeling derivations

The last post discussed a style of scientific programming that modeled derivations symbolically to create a machine-checked chain from starting equation to final code.


This post will provide an example using the configuration integral for the canonical ensemble in statistical mechanics. I want to evaluate the integral using various methods, starting with grid-based integration methods. Using a grid-based method suffers from the curse of dimensionality, where adding particles raises the time to compute exponentially (which is why Monte Carlo methods are normally used). However, it can still be useful to check Monte Carlo codes with other integration methods for small particle numbers. Also, the free energy (and all associated quantities) is computed directly, where it is more difficult to compute with MC methods. Finally, I'm curious as to how large of a system could be computed on today's hardware.


Sympy is the computer algebra system I'm using. The code for this example is available on github in the derivation_modeling branch, in the prototype subdirectory.


The output (MathML and generated python) can be seen here.


The flow is as follows:


  1. Derive the composite trapezoidal rule, starting from the expression for a single interval

  2. Generate code for the composite trapezoidal rule

  3. Start from the configuration integral (which I've called the partition function in the documents) and transform it by specializing various parameters - 2 particles in 2 spatial dimensions, using the Lennard-Jones interaction potential, etc - until it becomes completely specified numerically.

  4. Generate code that evaluates this integral using the generated integration code from step 2



There's still a tremendous amount of work to do on this, but hopefully an outline of this style of programming is visible from the example.



  • The MathML pages have a lot of detailed small steps - it would be nice to have a collapsible hierarchy so all the little steps can be hidden.

  • Some abstractions should remain during the steps - in particular the potential. The expression for the integrand can get quite messy from explicitly inserting the expression for the potential (and this is a simple case - more complex potentials would make it unreadable).

  • More backends for the code generation - C/C++ is the obvious choice in pursuit of higher performance. (The generated python can be run through Shed Skin, for another route to C++). Converting to Theano expressions might be a quick way to try it out on the GPU.


Thursday, April 07, 2011

Modeling derivations

A previous post describes how I kept reference versions of important derivations in a document. I would like to move the idea out of LaTeX and into some format more amenable to machine checking. To do this requires some structure for describing and modeling a derivation.


A derivation starts with some base equation, then proceeds through a series of steps transforming it into a form that can be solved numerically (A symbolic/analytic solution would be nice, but not possible in most cases.) The steps can be divided into three categories



  1. Exact transformations - rearranging terms, replacing an expression with a different form, the usual steps involved in a proof, etc

  2. Approximations - using a truncated series expansion, replacing a derivative with discretized version, etc.

  3. Specializations - Specifying the physical or model parameters in order to make the equation numerically solvable - like the number of spatial dimensions, number of particles, interaction potentials, etc.




At the end of the derivation, the result can be evaluated numerically using the CAS (Computer Algebra System - I assume this derivation modeling has been implemented using one) facilities, if simple enough. More complex or time-consuming results (a more likely case) can be further transformed by code generation to C/Fortran (or some other execution system).



One important feature of this workflow is every step in creating the program is checked by machine. This should greatly reduce transcription errors (dropped minus signs and factors of 2 - this is what started the whole project in the first place). Also it moves part of the process of creating scientific programs into the realm of 'ordinary software', and as such makes it subject to normal software engineering techniques - unit testing, etc. Testing scientific (numerical) software is a difficult problem - this should help make it (a little) easier and more reliable.


Some other work I've run across related to describing mathematics:

  • Structured derivations were developed for teaching proof in high school mathematics.

  • OMDoc is a markup language for describing mathematical structures at a higher level than a single formula (a proof, for instance)